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Summary. To learn (statistical) dependencies among random variables requires exponentially 
large sample size in the number of observed random variables if any arbitrary joint probability 
distribution can occur. 

We consider the case that sparse data strongly suggest that the probabilities can be described 
by a simple Bayesian network, i.e., by a graph with small in-degree A. Then this simple law 
will also explain further data with high confidence. This is shown by calculating bounds on 
the VC dimension of the set of those probability measures that correspond to simple graphs. 
This allows to select networks by structural risk minimization and gives reliability bounds on 
the error of the estimated joint measure without (in contrast to a previous paper) any prior 
assumptions on the set of possible joint measures. 

The complexity for searching the optimal Bayesian networks of in-degree A increases only 
polynomially in the number of random varibales for constant A and the optimal joint measure 
associated with a given graph can be found by convex optimization. 

1 . Bayesian networks and the causal Markov condition 

Learning statistical dependencies among a set of n random variables X\ , . . . , X n is an im- 
portant tool of scientific research. Formally, the task of learning those dependencies is to 
obtain some information about the joint probability measure P where P{x\, . . . ,x n ) de- 
notes the probability of the event Xi = x\, . . . , X n — x n ] . A useful way to represent such 
information in a graphical way is given by the concept of Bayesian networks (Pearl, 1985). 

Although one may consider Bayesian networks merely as a way of encoding statistical 
dependencies into a graph, the concept is better understood if a causal interpretation is 
given to the graph. Recall that every joint probability P can be factorized as 

n 

P(xi,X 2 ,...,X n ) = P(xj\xi, . . . ,Xj-i) , 
3=1 

where P(xj\xi, . . . , Xj-%) are the conditional probabilities given the values x%, . . . , Xj-% 
of Xi, X2, ■ ■ ■ ,Xj-\. Let G be a directed acyclic graph. Assume that G represents the 
underlying causal structure of X\, X 2 , ■ ■ ■ , X na i. An arrow from Xj to X\ indicates that Xj 
influences Xi directly (here "directly" means that the causal effect is not intermediated by 
another variable X m ). Assume that the variables are ordered in a way that is consistent 

fHere we assume that each random variable Xj can only take values in a finite set Qj . 
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with G, i.e., there is no arrow from any Xj to a variable Xi with I < j. In case the variables 
Xj correspond to definite and different times tj , one may think of this order as the time 
order t\ < t 2 < ■ ■ ■ < t n . Due to the fact that each Xj is only (directly) influenced by 
its parents (the nodes with an arrow to Xj,) we can give a simpler factorization for P as 
follows: 

P(x 1 ,X 2 , ...,X n ) = J J /'(•'•., •'•,::••';;:•„>• ■ • • , Xj. kj ) , (1) 
3 

where Xj.\, . . . ,Xj ;kj are the kj parents of Xj. It can be shown (Pearl, 2000) that this 
factorization implies the so-called Markov condition defined as follows. 

Definition 1 . Let P be a joint probability distribution of n random variables 
X\, X 2 , ■ ■ ■ , X n and G a directed acycylic graph with the variables as nodes. Then P is said 
to satisfy the Markov condition relative to G if for each variable Xj the following condition 
holds: 

Given the values of all the parents of Xj, the variable Xj is statistically independent 
from the set of those nodes Xi with I ^ j that are no (direct or indirect) descendants of Xj. 

Here statistical independence of two sets X := {X\, . . . , X k }, y := {Y\, . . . , Y{\ of vari- 
ables given a third set Z := {Z\, . . . , Z m } is defined by the condition 

P(x!,x 2 , ■ ■ .,x k ,yi,y 2 , . . .,yi\zi, . . .,z m ) = P(x 1} . . .,x k \zi, . . .,z m )P{y 1 , . . . ,yi\z x , . . .,z m ) 

for all possible assignments of the values Xi, yj, z r . Every graph G defines a set of probability 
distributions: 

Definition 2. Let G be an arbitrary directed acyclic graph on n nodes labeled with the 
random variables X\, . . . , X n . Then Vg is the set of all joint probability distributions of 
X\,..., X n that satisfy the Markov condition relative to G. 

Given an arbitrary order on X\,X 2 , . . . ,X. n we define the complete acyclic graph G c 
corresponding to the order as the graph with an arrow from each Xj to each Xi with I > j. 
Note that every probability measure is Markovian relative to G c . Hence one can find graphs 
G such that P is Markovian relative to G by testing which edges can be removed from G c 
without violating the Markov property. Then a Bayesian network is formally defined as the 
pair (G, P) where G is a directed acyclic graph with random variables as nodes and P a 
joint probability measure satisfying the Markov property relative to G. 

Of course a graph G does not necessarily coincide with the true causal structure when 
the measure is Markovian relative to G. We do not focus on the deep problem of inferring 
causal structure from statistics (Pearl, 2000). Here we mentioned the causal point of view 
only to emphasize that simple Bayesian networks may stem from a simple causal structure. 
The goal in this article is not to infer the causal structure but rather to infer properties of 
the probability measure from sparse data. 

It is interesting to note that the graph G determines directly the free parameters of those 
probability measures that are Markovian relative to G since the probability measure P is 
determined once the "transition probabilities" P(xj\xj-i,Xj- 2 , . . . ,Xj- tkj ) are given. Once 
we have found a hypothetical graph with corresponding transition probabilities that seems 
to be in good agreement with the observed data we would like to judge whether we have 
really found a good model or whether the good agreement is rather caused by over-fitting 
our limited amount of data. In (Wocjan and Janzing, 2002) upper bounds on the required 
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sample size for learning the probabilities of all fc-tuples (xj t , Xj 2 , . . . , Xj k ) with a certain 
accuracy and reliability are given. Under the assumption that the true probability measure 
is Markovian relative to a simple graph this will give the joint measure on the n variables 
up to a certain accuracy. Here we do not make any prior assumptions on the underlying 
joint probabilities. Merely the fact that we have found a simple model that does explain 
data shall give the accuracy and reliability of our guess. 

For each node Xj , the number of free parameters increases exponentially with the num- 
ber kj of parents of Xj . Therefore it seems to be a reasonable concept to bound the number 
of free parameters of the available probability measures by considering graphs with small 
in-degree (i.e., the maximal number of parents) in order to avoid over- fitting. However, 
this is a heuristic argument. In the context of learning theory, Vapnik (1998) has argued 
that a small number of free parameters of the set of available functions to fit observed data 
is neither sufficient nor necessary to avoid over-fitting. He showed the so-called Vapnik- 
Chervonenkis (VC) dimension of a set of functions to be decisive. But we will show that 
it does make sense from the point of view of learning theory to consider graphs with small 
in-degree since we can derive upper bounds on the VC dimension of the set of corresponding 
probability measures. 

In Section [2] we formulate the criterion for judging whether a hypothetical measure fits 
well the observed data and give a meaning on what defines a "good guess" of statistical 
dependencies, i.e. we define a risk functional quantifying the goodness of fit. In Section 
we rephrase the concept of VC dimension and explain the general idea to use it for obtaining 
reliability bounds when an unknown function is to be learned. In our case the unknown 
function is the joint probability distribution on n random variables. Therefore we derive in 
Section ^bounds on the VC dimension of sets of joint distributions corresponding to given 
graphs and sets of graphs. We show how to obtain reliability bounds for the estimated 
distribution. In Section [S] we show how to apply structural risk minimization principle in 
order to learn Baeysian networks reliably. In Section we implement the minimization as 
a convex optimization problem. 

2. Selection criterion for hypothetical networks 

Assume the data are given by I n-tuples 



Using prior knowledge on the underlying causal structure we might prefer a specific graph 
G. It should have the property that a probability distribution in Tg describes the observed 
data already very well. In order to find the best distribution in Vg we use the following 
approach (Vapnik, 1998). Consider the observed relative frequencies H(x) formally as a 
probability measure over the set of possible n-tuples f2 := f2i x Q 2 x ■ ■ ■ x &n and minimize 
the Kullback-Leibler relative entropy (Cover and Thomas, 1991) 



X , X , . . . , X . 



K(P\H) := B"(x) lnP(x) - ]T (x) In (x) , 




which is equivalent to the minimization of 
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over all P £ Vg- By the law of large numbers R em p(P) converges to 

R(P) :=^P(x)mP(x), 

where P is the true probability measure on fl. Note that R(P) yields for P = P 

fl(P) = ^P(x)hP(x) = S(P), 
xefi 

i.e., the entropy of P. It measures the quality of the hypothesis concerning the statistical 
dependencies since it measures whether those events that have been predicted to be rather 
unlikely by the hypothetical measure do really occur rarely. Therefore a low value of R(P) 
does not only mean that the Kullback Leibler distance between P and the true measure 
P is low but it also implies that the entropy of P is low. This means that we have found 
strong statistical dependencies. They may, for instance, indicate strong causal influences 
among the variables. This justifies to consider R(P) as a criterion that measures whether 
the hypothetical measure P is not only good in the sense that its deviation from P is small 
but also that we have found a law with high predictive power. 

The minimization above is quite convenient since the factorization in eq. JQ) corresponds 
to a sum of the logarithms of the conditional probabilities. It is clear that this minimization 
does not make sense if G = G c is the complete acyclic graph for a given order. Since all 
measures are in Vg c we would clearly obtain P — H - a fatal over- fit. This example 
shows intuitively that the minimization above leads to over-fitting when G has too many 
arrows. In order to consider this problem from a perspective of statistical learning theory 
we rephrase the essential concepts in the next section. 

3. Risk estimation by statistical learning theory 

As explained above a major problem in inferring the true probability measure P from the 
set of training data is over-fitting. It is not sufficient that R emp (P) is small, we rather would 
like to have R(P) small. Abstractly speaking the problem reads: Given a family (/ Q ) of 
negative functions, consider the data points x*x 2 , . . . , x and choose f a in such a way that 
one can expect with high confidence that 

i?(/ Q ):=-E/«( x ) ( 3 ) 

is small. In this general setting the specific form of f a is not relevant, the problem is simply 
to choose a function f a from a family (/„) such that its expectation value is maximal. 
Statistical learning theory tells the following. If the family (f a ) is small enough with respect 
to a specific measure we can say with high confidence that for all a the risk R(f a ) deviates 
from the empirical risk 

iW/»):= y £ (4) 

3<l 

only by a small amount. (Note the slight abuse of notation. To be consistent, we should 
have written R emp (\n P) and i?(lnP) in Section [3] However, this should not lead to any 
confusions.) To make this precise we briefly explain the notion of VC dimension. First we 
introduce it only for two- valued functions ("indicator functions", or "classifiers"). 
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Definition 3. Let A be an index set of arbitrary cardinality. Let (f a )a£a be o, set of 
indicator functions on f2. Then the VC dimension of (f a )aeA is the largest natural number 
I such that there exists I points x 1 , x 2 , . . . , x' £ ft with the property that for every indicator 
function \ '■ { x1 j x2 7 ■ • • i x '} ~~ * {0, 1} there exists a function f a such that its restriction to 
{x^x 2 , . . . , x'} coincides with \- 

The definition of VC dimension of arbitrary real-valued functions relies on the VC di- 
mension of sets of indicator functions: 

Definition 4. Let (f a )aeA be a family of real-valued functions on a set CI. Then the 
VC dimension of (f a ) a eA is the VC dimension of the family of the indicator functions 
("classifiers") (x M ° / Q ) M eR,aeA- Here x M is defined by x /i (c) = for c < /i and x M (c) = 1 
for c> fi. 

The following theorem is a corollary from the statements in (Vapnik, 1998, pp.192, end 
of Section 5.3): 

Theorem 5. Let (f a ) be a set of measurable real-valued functions on f2 bounded below 
and above by A and B, respectively. Let h be the VC dimension of the set. 

Then for any training data x 1 , x 2 , . . . , x' we have with probability at least 1 — rj 

R(fa) < Remp(fa) + <t>A,B(l,h,?]) 

with 

(/> A , B (l, h, 77) := (B - A)y j 

for all functions f a . 

Note that the reliability bound is uniform on the family (/ Q ), i.e., with probability 1 — n 
the difference between R em p(fa) and R(f a ) is for all f a bounded by the second term in eq. 
©■ 

In the following section we will give bounds on the VC dimension of certain sets of joint 
probability distributions of n variables. 

4. The VC dimension associated with a graph or a set of graphs 

The factorization of Markovian joint distributions in eq. Q is decisive for the upper bound 
on the VC-dimension of Vq- Note that the VC-dimension of the families 

(P)peVa 

and 

(lnP) PeVc 

coincide. Note furthermore that Vg contains also all distributions that are Markovian 
relative to a graph G' whenever G' was obtained from G by deleting some arrows. In this 
sense, one considers always a set of graphs when general distributions in Vg are considered. 
Let nij := the number of elements of We find: 



(5) 
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Theorem 6. Let rj be the indices of the set Pj of parents of Xj with respect to a given 
graph G. Then the VC-dimension of Vg is a t most 

j<n ie(r 3 U{j}) 

Proof. We show that the logarithms of all probability distributions in Vg can be 
written as a linear functional in a common Nq dimensional vector space. For each set 
j = {ji, ■ ■ ■ ,3k] C {1, • ■ • , n} we define 

ftj := % xfl j2 x...x fl jk . 

Then we define the vector space Vj as the set of real- valued functions on 

The dimension of Vj is clearly given as 

ie(r 3 U{j}) 

By setting 

V := ®j<nVj 

we obtain a vector space of dimension Nq. Now we define a vector fj e Vj by 

fj(xj,Pj) ■= lnP(xj|pj) 

and 

/ := ®j<nfj ■ 

For each n-tuple x := (xi, . . . , x n ) we define a vector 

C x := e 3 c* e v 

where each vector c* is 1 for the entry that corresponds to the restriction of x to pj U {j} 
and for all the other values. Then the logarithm of the probability of x can be obtianed 
by 

lnP(xi,...,a: n ) - (c x |.f) . 

This shows that the logoarithm can be written as linear functional in V. The VC dimension 
of the set of linear functions in H N is N (Vapnik, 1998). This completes the proof. 

The idea of the proof is quite similar to the proof of Lemma 2 in (Herrmann and Janzing, 
2003). There we have given an upper bound on the VC dimension of the set of so-called 
fc-factor log-linear models. These are probability distributions with the property that their 
logarithm can be written as a sum of functions depending on k variables only. Here we have 
considered a specific factorization coresponding to a given graph. This prior knowledge 
decreases the VC-dimension. 

Now we consider the case that no specific graph is given but all graphs with a given 
in-degree are allowed which respect a given order on the set of random variables. We find: 
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Theorem 7. Let X\ < X2 < . . . < X n be an ordering on the set of random random 
variables. Let Va be the set of all measures that are Markovian relative to an appropriate 
graph with in-degree A which is consistent with the order, i.e., there are only arrows from 
Xi to Xj for i < j . Then the VC dimension of Va is at most 

n 

N A :=^2 rrijm^ . . . m iA , (6) 

J = l i 

where the second sum runs over all A-subsets i := {ii,«2, • ■ ■ ,*a} of {I, . — 1}. 

Proof. We extend the proof of Theorem|Bl The definition of each space Vj given there 
depends on one particuliar choice of the parents of Xj. Now we have a vector space V? 
corresponding to each possible choice of parents of Xj given by one specific A-subset i for 
each node j. We define 

V := ®j< n ®i V- . 

One checks easily that Na is the dimension of V. Note furthermore that also the definition 
of each c x in the proof of Theorem [5] depends on one specific choice i of the parents of Xj. 
Hence we obtain now a different vector c x '' for each i. In analogy to the proof of Theorem 
[Uwe assign a vector c x to each n-tuple x = {x\, . . . ,x n ) by 

C '. = (£)j (Bi c ' . 

Let P be an arbitrary probability measure in Va- The proof of Theorem [fj] assigns a vector 
/ € V to this measure. Note that there is a canonical embedding of the vector space V 
introduced in the proof of Theorem into the space V defined here since each Vj defined 
there corresponds to one specific V* here. With this embedding we have 

lnP( Xl ,...,x n ) = (c x |/). 

Hence the logarithms of probabilities can be written as an inner product in a vector space 
of dimension TVa- This completes the proof. 

If no specific order on the random variables is given a priori the VC dimension of all 
graphs with a fixed in-degree is bounded as follows: 

Theorem 8. Let Va be the set of probability measures that are Markovian relative to 
some graph G with in-degree A. Then the VC dimension ofVA is at most 

NA:=Y,H m *> ( ? ) 

where the sum runs over all (A + l)-subsets j of {1, 2, . . . , n}. 

The proof follows from the observation that each P 6 Va is a (A + l)-factor log-linear 
model, i.e., a probability distribution with the property that its logarithm can be written as 
a sum of functions each depending on at most A + 1 variables. Then the bound of Lemma 
2 in (Herrmann and Janzing, 2003) applies. 

The following corollary from Theorem [3] shows explicitly how to use the bounds on the 
VC-dimensions in order to get reliability bounds on the estimated risk functional. Note that 



8 D. Janzing and D. Herrmann 



it is therefore necessary to restrict one's attention to sets of probability measures which are 
bounded below. Explicitly, we define: Let V x for each A > be the set of joint distributions 
P with the property that 

P(xi,X2, ...,»„)> A 

for all n-tuples in il. Note that we do not assume that the true probability measure P 
satisfies this requirement. Only the hypothetical measure P has to be bounded. Then the 
bounds A and B in Theorem are and — In A, respectively. We conclude: 

Corollary 1. Let V C "P A be a set of joint distributions with VC- dimension h. Then 
for any training data x 1 , x 2 , . . . , x' we have with probability at least 1 — rj 

with 

fe(i , ft ,, ):=( - lnA) ^W + l)-lnW4) + l (8) 

uniformly for all P E V . 

Setting V :— Vq we obtain reliability bounds on the estimated probability measure 
provided that the graph G has been chosen in advance. With V := V\ we obtain reliability 
bounds if the hypothetical measures are restricted to those that factorize to a "simple" 
graph (in the sense of small in-degree). 

However, the prior restriction to a specific A and a specific graph or to graphs with 
small in-degree is not acceptable. An appropriate way to learn Bayesian networks should 
also consider complex graphs provided that sufficiently large sampling strongly indicate 
a more complicated dependency among the variables. Similarly, one should not a priori 
exclude probabilities that are smaller than a specific value A. For large sample size data 
may give strong evidence that some probabilities are indeed small. On the other hand, the 
estimation in Corollary^ seems to require prior restrictions. 

This problem is solved by structural risk minimization principle (Vapnik, 1995 and 1998) 
in statistical learning theory. It uses a hierarchy of increasing sets of hypothetical functions. 
Then a function g from a larger set is only preferred compared to a function / from a smaller 
set if not only R em p{g) < Remp(f) but also the bound on R(g) is smaller than the bound 
on R(f). We explain this principle in the following section. 

Now we briefly summarize the estimations for the VC dimension of some interesting set 
of graphs. Here we assume that I is the maximum over all values m,j. 

• For the VC dimension of a given graph G with in-degree A we have 

h < nl A+1 . 

This follows from Theorem HO since 

II m t < l A+1 . 

• For the VC dimension of all graphs with in-degree A which respect a given order on 
the nodes we have 

3=1 V 7 
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This is due to Theorem {7\ since the second sum in eq. (0 runs over 



i-i 



) 



terms. 



• For the VC dimension of the set of all graphs with in-degree A we have 



h < 



A + l 



) 



This follows from Theorem [S] since the sum in eq. Q runs over 



( 



A + l 



n 



) 



terms. 



This seems to suggest that in general a small in-degree reduces the VC dimension con- 
siderably whereas prior knowledge on the causal order is less relevant. 

5. Structural risk minimization 

Before we apply structural risk minimization to the problem of learning probabilities we 
briefly sketch the general idea. Consider the case that an arbitrary function on a set Q is 
to be learned. Define a sequence (Fk)keN of families Fk of functions. The idea is that the 
sequence defines a hirarchy of more and more complex families of functions and the less 
complex ones are a priori preferred. Let (pk)keK be any sequence of non-negative numbers 
with ^2 k Pk = 1. These values express to what extent one tends to prefer functions from 
Fk with lower k. Let hk be the VC-dimcnsion of Fk- Then one has with probability 1 — r\ 
that for each function / 6 (J, Fk 



where <f> is the confidence term in eq. (JSJ. This is a standard union bound argument (see e.g. 
Herbrich, 2002). Note that the sequence on pk may be chosen in such way that it expresses 
prior probabilities to the choice of a certain class F k . But it should be emphasized that the 
reliability bound in eq. © does not rely on this interpretation. 

Here we define a hirarchy of probability measures which takes into account two aspects of 
a measure: We prefer measures which are Markovian relative to a simple graph and measures 
with high cut-off value A. Let (A m ) m6 j\j be a sequence of positive values converging to zero. 
Let V Xm be the set of probability measures bounded from below by X m . Let V\, ■ ■ ■ ,V r be 
r sets of probability measures. They may, for instance, correspond to an enumeration of 
all directed acyclic graphs on n nodes. They may also correspond to graphs with in-degree 
1,2, ... ,r. Then we prefer probability measures in Vk H V Xm for small m and small k. 
We may express this by defgining probabilities qk,m which are decreasing in k and m. In 
analogy to the bound above we obtain: 



Rtf)<Remptf) + <Kh k ,l,PkV), 



(9) 
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Theorem 9. Let (Pk)keK with K = N or K = {1, . . . , r} an arbitrary set of families 
of joint distributions on the n random variables. 

Let (qk,m)k,m define an arbitrary probability measure on K x N. Let h k be the VC- 
dimension of V k . Then we know with probability 1 — rj for all k < r,m S N and all 

P eV k n V Xm 

R(P) < Re mp (P) + <t>\(h k , I, q k>m ri) 

holds, with 



, , s , . h k {\n{2l/h k ) + 1) - Hq k , m n/4) + 1 

The structural risk minimization principle works as follows. For a given number k,m 
choose Pk, m € V Xm n V k such that R em p(Pk,m) is minimal. Then choose k,m such that 

Re?n P (P k ,\) + <t>\ m (l, h,q k . m r]) 

is minimal. 

The following example gives an idea how to apply this principle. Given n binary vari- 
ables. Then our upper bound on the VC dimension of the set of graphs with in-degree A 
is 

h A < n2 A+1 . 

Let V k be the set of joint distributions which are Markovian relative to a graph with 
in-degree k. Set furthermore A m ~ 2~ m and choose the prior probability measure on N x N 
as 

_ n—k—m 

q k ,m ■— * 

For P £ V k n V Xm we obtain 



jn2 k + 1 (M2l/(n2 k + 1 )) + 1) - 1h(t7/4) + (k + m) In 2 + 1 
(Px m (hh k ,q k ^ m rj) = mln2y 

The confidence term grows exponentially in k, with 0(y / n) and with 0(m 3 / 2 ) whenever the 
other parameters are fixed. Hence the required sample size grows quickly with the in-degree, 
whereas the number of random variables is less decisive. Also the cut-off value A of the 
probabilities is less decisive since the required sample size grows only with 0(m 3 / 2 ) although 
we have defined the cut-off values A m in such a way that they decrease exponentially in to. 



6. Convex Optimization for Bayesian networks 

The number of directed acyclic graphs with constant in-degree A and n nodes increases 
polynomially in n. Therefore it is realistic to assume that for all graphs with small in- 
degree ("sparse graphs") the optimization can be carried out for each graph. Hence we may 
restrict our attention to finding the optimal probability measure that is Markovian relative 
to a given graph G and bounded by a given value A from below. Let Vj be defined as in the 
proof of Theorem i.e., the set of real- valued functions on 



r 3 U{j} 
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Let x l be the i-th. observed n-tuple. Let x*| r .u{j} its restriction to the variable Xj and all 
its parents. Then the task is to find a vector 

/ = e i / i € ® j v j = v 

that minimizes 

Rempif) ■= y ^2^2 fji^lr^y}) 

i<l j<n 

subject to the following constraints: 

(a) For each j the sum of the conditional probabilities P(xj \oj) over all Xj <E Slj has to be 
1 for all A-tuples u> € fl Pj . Formally this means 

z u (fi) ■■= ex p( E /i(^-.w)) = i (io) 

for w € £l Pj . . 

(b) No probability P{x\, . . . , x„) is less than A. We can achieve this by stating the stronger 
constraint that no transition probability P(xj\w) is less than A 1 /™. This is equivalent 
to 

G u ,j, Xj U) ■■= fj(xj,uj) > ~ lnA - 

The optimization is rather similar to that one in (Herrmann and Janzing, 2003) with the 
decisive difference that the normalization can be performed for each node separately here 
whereas the normalization condition for the joint measure on n variables involves a sum over 
all possible n-tuples, i.e., a number growing exponentially in n. Here the computational 
complexity grows only polynomially in n for constant k. The number of constraints grows 
linearly in n but exponentially in k. The number of terms in the sum (|1U|I grows also 
exponentially in k. But since we assume that k is small we consider the optimization as 
computationally tractable. Due to the convexity of the constraints (see Herrmann and 
Janzing, 2003) it is a usual linear programming problem that can be efficiently solved 
(Pallaschke and Rolcwicz, 1997). 

7. Conclusions 

We have presented a method for estimating the joint distribution of a large number of 
random variables from sparse data. The statistical dependencies among the variables are 
explained by Bayesian networks such that networks with simple graphs are preferred. We 
provide reliability bounds without restricting the set of joint distribution under considera- 
tion. We have shown that the set of probability measures that are markovian relative to 
simple graphs have low VC-dimension. This guarantees reliable estimation in the sense of 
statistical learning theory whenever the observed data is explained well by those "simple 
measures" . If no simple Bayesian network fits the data the method does not allow reliable 
estimation. Furthermore we have shown that finding the optimal distribution within a class 
of distributions (markovian relative to a given graph) is a convex optimization problem. 
Since the number of simple graphs is not too large, the whole estimation can be performed 
efficiently. 
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